87 research outputs found
Auto-Sizing Neural Networks: With Applications to n-gram Language Models
Neural networks have been shown to improve performance across a range of
natural-language tasks. However, designing and training them can be
complicated. Frequently, researchers resort to repeated experimentation to pick
optimal settings. In this paper, we address the issue of choosing the correct
number of units in hidden layers. We introduce a method for automatically
adjusting network size by pruning out hidden units through
and regularization. We apply this method to language modeling and
demonstrate its ability to correctly choose the number of hidden units while
maintaining perplexity. We also include these models in a machine translation
decoder and show that these smaller neural models maintain the significant
improvements of their unpruned versions.Comment: EMNLP 201
Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution
Zero-shot cross-lingual transfer is when a multilingual model is trained to
perform a task in one language and then is applied to another language.
Although the zero-shot cross-lingual transfer approach has achieved success in
various classification tasks, its performance on natural language generation
tasks falls short in quality and sometimes outputs an incorrect language. In
our study, we show that the fine-tuning process learns language invariant
representations, which is beneficial for classification tasks but harmful for
generation tasks. Motivated by this, we propose a simple method to regularize
the model from learning language invariant representations and a method to
select model checkpoints without a development set in the target language, both
resulting in better generation quality. Experiments on three semantically
diverse generation tasks show that our method reduces the accidental
translation problem by 68% and improves the ROUGE-L score by 1.5 on average.Comment: Findings of ACL 202
Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity
Mixture-of-experts (MoE) models that employ sparse activation have
demonstrated effectiveness in significantly increasing the number of parameters
while maintaining low computational requirements per token. However, recent
studies have established that MoE models are inherently parameter-inefficient
as the improvement in performance diminishes with an increasing number of
experts. We hypothesize this parameter inefficiency is a result of all experts
having equal capacity, which may not adequately meet the varying complexity
requirements of different tokens or tasks. In light of this, we propose
Stratified Mixture of Experts (SMoE) models, which feature a stratified
structure and can assign dynamic capacity to different tokens. We demonstrate
the effectiveness of SMoE on three multilingual machine translation benchmarks,
containing 4, 15, and 94 language pairs, respectively. We show that SMoE
outperforms multiple state-of-the-art MoE models with the same or fewer
parameters.Comment: Accepted at Findings of EMNLP 202
Condensing Multilingual Knowledge with Lightweight Language-Specific Modules
Incorporating language-specific (LS) modules is a proven method to boost
performance in multilingual machine translation. This approach bears similarity
to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the
scalability of this approach to hundreds of languages (experts) tends to be
unmanageable due to the prohibitive number of parameters introduced by
full-rank matrices in fully-connected layers. In this work, we introduce the
Language-Specific Matrix Synthesis (LMS) method. This approach constructs LS
modules by generating low-rank matrices from two significantly smaller matrices
to approximate the full-rank matrix. Furthermore, we condense multilingual
knowledge from multiple LS modules into a single shared module with the Fuse
Distillation (FD) technique to improve the efficiency of inference and model
serialization. We show that our LMS method significantly outperforms previous
LS methods and MoE methods with the same amount of extra parameters, e.g., 1.73
BLEU points over the Switch Transformer on many-to-many multilingual machine
translation. Importantly, LMS is able to have comparable translation
performance with much fewer parameters.Comment: Accepted at the main conference of EMNLP 202
- …